About the IODS-project

This course is just what I need now.

We are writing couple of articles in psychoacoustics and need to use GAM and PCA analysis. This helps me to go deeper in R programming and brings also better understanding, what my colleague is doing in GAM (General Additive Model). Earlier MOOC courses in statistics have been interesting and useful. This is an excellent continuum.

I have a new GitHub repository dedicated to this project:

https://github.com/jussijaatinen/IODS-project


Analysis of Learning 2014 Database


The data has been collected in query about teaching and learning after the course of “Introduction to statistics for social science”“, fall 2014. The most essential parts of study (Deep, Surface and Strategic learning) have been named and combined so, that connection to the thought dimensions can be seen.


Summary of dataset.

There were twice as much females than men in answer group. Dataset has 166 observations and 7 variables.

 gender       Age           Attitude          deep            stra      
 F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
 M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
         Median :22.00   Median :3.200   Median :3.667   Median :3.188  
         Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
         3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
         Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
      surf           Points     
 Min.   :1.583   Min.   : 7.00  
 1st Qu.:2.417   1st Qu.:19.00  
 Median :2.833   Median :23.00  
 Mean   :2.787   Mean   :22.72  
 3rd Qu.:3.167   3rd Qu.:27.75  
 Max.   :4.333   Max.   :33.00  
'data.frame':   166 obs. of  7 variables:
 $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
 $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
 $ Attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
 $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
 $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
 $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
 $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

Correlations between variables.

Due to largest correlation with dependent variable “Points”, three variables were chosen to the model. The independent variables were Attitude (R=0.44), Strategic learning (R=0.15) and Surface Learning (R=-0.14). Genders were combined in correlation test.


Regression models

For testing, which contributions of chosen independent variables were significant, several possible alternatives were evaluated for the best regression model.


Call:
lm(formula = learning2014$Points ~ learning2014$Attitude + learning2014$stra + 
    learning2014$surf, data = learning2014)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.1550  -3.4346   0.5156   3.6401  10.8952 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            11.0171     3.6837   2.991  0.00322 ** 
learning2014$Attitude   3.3952     0.5741   5.913 1.93e-08 ***
learning2014$stra       0.8531     0.5416   1.575  0.11716    
learning2014$surf      -0.5861     0.8014  -0.731  0.46563    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.296 on 162 degrees of freedom
Multiple R-squared:  0.2074,    Adjusted R-squared:  0.1927 
F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Call:
lm(formula = learning2014$Points ~ learning2014$Attitude + learning2014$stra, 
    data = learning2014)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.6436  -3.3113   0.5575   3.7928  10.9295 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             8.9729     2.3959   3.745  0.00025 ***
learning2014$Attitude   3.4658     0.5652   6.132 6.31e-09 ***
learning2014$stra       0.9137     0.5345   1.709  0.08927 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.289 on 163 degrees of freedom
Multiple R-squared:  0.2048,    Adjusted R-squared:  0.1951 
F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Call:
lm(formula = learning2014$Points ~ learning2014$Attitude + learning2014$surf, 
    data = learning2014)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.277  -3.236   0.386   3.977  10.642 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            14.1196     3.1271   4.515 1.21e-05 ***
learning2014$Attitude   3.4264     0.5764   5.944 1.63e-08 ***
learning2014$surf      -0.7790     0.7956  -0.979    0.329    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.32 on 163 degrees of freedom
Multiple R-squared:  0.1953,    Adjusted R-squared:  0.1854 
F-statistic: 19.78 on 2 and 163 DF,  p-value: 2.041e-08

Call:
lm(formula = learning2014$Points ~ learning2014$Attitude + learning2014$stra + 
    learning2014$gender, data = learning2014)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.7179  -3.3285   0.5343   3.7412  10.9007 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             8.9798     2.4030   3.737 0.000258 ***
learning2014$Attitude   3.5100     0.5956   5.893 2.13e-08 ***
learning2014$stra       0.8911     0.5441   1.638 0.103419    
learning2014$genderM   -0.2236     0.9248  -0.242 0.809231    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.304 on 162 degrees of freedom
Multiple R-squared:  0.2051,    Adjusted R-squared:  0.1904 
F-statistic: 13.93 on 3 and 162 DF,  p-value: 3.982e-08

Call:
lm(formula = learning2014$Points ~ learning2014$Attitude, data = learning2014)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.9763  -3.2119   0.4339   4.1534  10.6645 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            11.6372     1.8303   6.358 1.95e-09 ***
learning2014$Attitude   3.5255     0.5674   6.214 4.12e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.32 on 164 degrees of freedom
Multiple R-squared:  0.1906,    Adjusted R-squared:  0.1856 
F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

As seen above, only Attitude was statistically significant of chosen variables.

If comparing R squared (R2) values between models, it can be seen that Attitude variable alone explains almost as large amount of variance as if combined with other two of tested variable. Still it explains only a fraction (< 20%) of total variance of this data.


Validity of model

For testing validity of model and assumption of normal distributed samples following tests have been done: Residuals vs Fitted, Normal Q-Q and Residuals vs Leverage. As we can see in Residuals vs Fitted, residuals are reasonable normally distributed. In Normal Q-Q plot we can see that residuals follow linearity well enough (except far ends). In Residual vs Leverage image there are no outliers in isolated positions far from mean. Therefore assumption of linearity and normal distribution can be accepted.


Analysis of Alcohol consumption in two Portugese secondary schools


Data, hyphotheses and selected explanatory variables

The data of this study has been collected from two Portugese second grade schools by using school reports and questionnaires. Aim of this analysis is to find relationships between high alcohol consumption and hypothesized explanatory variables.

Hyphotheses:

  • Males consume more alcohol
  • It is more common to use alcohol with friends outside home
  • Good family relations decrease drinking
  • Drinking may cause absences

According to hyphotheses, the selected explanatory variables are sex, go out, family relations and absences. According to first hyphothesis, all analyses have been divided to two groups, females and males.

In dataset there are lot of background variables like demographic, social relationships, parents´education and employment, or living area as well as grades, absences or health.

Full descriptions of study is available here: https://archive.ics.uci.edu/ml/datasets/Student+Performance.


All variables and distributions

Observations about selected explanatory variables:

  • Sex: Student’s sex, 198 females, 184 males (‘F’ - female or ‘M’ - male)
  • Goout: Going out with friends, quite normally distributed, median 3 (1 - very low, 5 - very high)
  • Famrel: Quality of family relationships, negatively skewed, median 4 (1 - very bad, 5 - excellent)
  • Absences: Number of school absences, multimodal, not normally distributed, median 3 (min. 0 - max. 45)

For alcohol consumption related study, the original data has been modified by combining and averaging weekend and weekdays alcohol consumptions. As result two new variables have been created: alc_use (Likert, 1-5, where 1 = very low, 5 = very high) and high_use (logical, TRUE > 2).

Total number of variables after adding is 35. Number of observations is 382. All variables and their distributions can be seen below.


Distributions and observations of chosen variables

Relationship between alcohol consumption and sex

As seen in table below, in generally, males consume more alcohol. In low consumption group (very low or low = FALSE), 58% were females and 42% males, whereas in higher consumption group only 37% were females and 63% males.

     
      FALSE TRUE Sum
  F     156   42 198
  M     112   72 184
  Sum   268  114 382
Relationship between alcohol consumption and going out with friends

Having more freetime outside home with friends increases alcohol consumption in both sexes significantly.

Relationship between alcohol consumption and absences

Although absences increase due to higher alcohol consumption, difference is quite moderate, thus larger in males.

Relationship between alcohol consumption and family relations

Although same tendency occur also in this case, there also might be two-sided correlation between alcohol consumption and family relations. Which of them is more cause or consequence is unclear. Bad relationships in family may lead bigger alcohol consumption, but also drinking may evoke collisions between family members.


Logistic regression

In logistic binomial regression selected explanatory variables (discret or continuous) explain dependent binary (high_use) variable. In this case, dependent variable is categorical, alcohol high use (high_use). It has constructed by classifying aforementioned alc_use variable. Values 1 and 2 of alc_use have been classified to low or false and values 3 - 5 to high or true.

All chosen variables (sex, goout, famrel and absences) are statistically significant. Especially sex is most significant of all and its odd ratio was the highest (2.84). This means that propability for high alcohol consumption in males is almost three times higher than in females. Another strongly influencing variable is “Going out with friends” (goout). None of the included variables include 1 in confidental interval, which means they all have meaningful influence to the model.


Call:
glm(formula = high_use ~ famrel + absences + sex + goout, family = "binomial", 
    data = alc)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7151  -0.7820  -0.5137   0.7537   2.5463  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.76826    0.66170  -4.184 2.87e-05 ***
famrel      -0.39378    0.14035  -2.806 0.005020 ** 
absences     0.08168    0.02200   3.713 0.000205 ***
sexM         1.01234    0.25895   3.909 9.25e-05 ***
goout        0.76761    0.12316   6.232 4.59e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 465.68  on 381  degrees of freedom
Residual deviance: 379.81  on 377  degrees of freedom
AIC: 389.81

Number of Fisher Scoring iterations: 4
                    OR      2.5 %    97.5 %
(Intercept) 0.06277092 0.01648406 0.2225470
famrel      0.67449969 0.51060362 0.8869199
absences    1.08510512 1.04012840 1.1353940
sexM        2.75203774 1.66768032 4.6128191
goout       2.15461275 1.70345866 2.7639914

Predictive power of the model

With chosen variables, predictive power of model (penalty) is ~ 0.20 (TRUE-FALSE + FALSE-TRUE pairs summed). This means that about 4/5 of predictions are correct (TRUE-TRUE + FALSE-FALSE pairs summed).

        prediction
high_use      FALSE       TRUE        Sum
   FALSE 0.66492147 0.03664921 0.70157068
   TRUE  0.16753927 0.13089005 0.29842932
   Sum   0.83246073 0.16753927 1.00000000

Mean prediction error

Mean prediction error of the model (loss function). This is determined by calculating propability of incorrect predictions.

[1] 0.2041885

K-fold cross-validation

For testing power of the chosen model, a method called cross-validation can be used. In cross validation sample data has been divided to test and train sections. This means that for example data is divided to 5 equal size parts and 4/5 is used as train data and 1/5 as test data. That is called 5-fold cross-validation. In K-fold CV data has divided to K parts, in this case K = number of observations. So, train data includes K-1 observations and test data 1 observations. In this function this process is circulated so all parts of data is in turn either train or test data.

In this case, K-fold cross-validation gives slighly higher prediction error (0.21) compared to mean prediction error.

[1] 0.2146597

10-fold cross-validation on my model.

This model have smaller prediction error (0.21) using 10-fold cross-validation compared to the model introduced in DataCamp (which had about 0.26 error). This means that chosen variables in my model predict better high alcohol consumption.

[1] 0.2172775

Cross-validation to compare the performance of different logistic regression models

The highest amount of predictors is in my original model introduced earlier. Other models are reduced combinations of original model or added to the model previously unused variables (N=18). In 10-fold cross validation, the smallest penalty loss (0.21) was in my original model. Changing variables increased or decreased penalty loss depending on significancy of chosen variables.

                    OR      2.5 %    97.5 %
(Intercept) 0.06277092 0.01648406 0.2225470
famrel      0.67449969 0.51060362 0.8869199
absences    1.08510512 1.04012840 1.1353940
sexM        2.75203774 1.66768032 4.6128191
goout       2.15461275 1.70345866 2.7639914
[1] 0.2015707
                   OR     2.5 %    97.5 %
(Intercept) 0.4877368 0.1702903 1.3635949
famrel      0.7475910 0.5811204 0.9595347
absences    1.0980228 1.0524564 1.1504519
sexM        2.7976292 1.7476499 4.5467291
[1] 0.2565445
                   OR     2.5 %   97.5 %
(Intercept) 0.7187304 0.2621851 1.935500
famrel      0.7857743 0.6158556 1.001830
absences    1.0905430 1.0452818 1.142647
[1] 0.2879581
                   OR     2.5 %    97.5 %
(Intercept) 1.1938851 0.4610715 3.0668110
famrel      0.7673146 0.6041200 0.9734032
[1] 0.3036649
                    OR       2.5 %     97.5 %
(Intercept) 0.01555815 0.005885392 0.03804621
absences    1.08782902 1.042458467 1.13933894
sexM        2.60835292 1.593132148 4.33151387
goout       2.07468346 1.650182481 2.64111050
[1] 0.2198953
                   OR      2.5 %    97.5 %
(Intercept) 0.0917904 0.02572428 0.3085338
famrel      0.7047893 0.53909482 0.9181526
absences    1.0772983 1.03352289 1.1268709
goout       2.1613089 1.71931650 2.7529251
[1] 0.2486911
                   OR     2.5 %    97.5 %
(Intercept) 0.4877368 0.1702903 1.3635949
famrel      0.7475910 0.5811204 0.9595347
absences    1.0980228 1.0524564 1.1504519
sexM        2.7976292 1.7476499 4.5467291
[1] 0.2696335
                   OR     2.5 %    97.5 %
(Intercept) 0.2775097 0.2012914 0.3760343
absences    1.0939748 1.0479459 1.1468072
[1] 0.2879581
                    OR       2.5 %     97.5 %
(Intercept) 0.01555815 0.005885392 0.03804621
absences    1.08782902 1.042458467 1.13933894
sexM        2.60835292 1.593132148 4.33151387
goout       2.07468346 1.650182481 2.64111050
[1] 0.2146597
                    OR      2.5 %    97.5 %
(Intercept) 0.09489664 0.02602013 0.3241397
famrel      0.66248876 0.50299888 0.8679153
sexM        2.54505957 1.56349803 4.1969640
goout       2.21913298 1.76086023 2.8372779
[1] 0.2251309
                   OR     2.5 %    97.5 %
(Intercept) 0.4877368 0.1702903 1.3635949
famrel      0.7475910 0.5811204 0.9595347
absences    1.0980228 1.0524564 1.1504519
sexM        2.7976292 1.7476499 4.5467291
[1] 0.2722513
                   OR     2.5 %    97.5 %
(Intercept) 0.2692308 0.1891905 0.3746607
sexM        2.3877551 1.5268584 3.7716682
[1] 0.2984293
                   OR     2.5 %    97.5 %
(Intercept) 0.8774021 0.3274356 2.3203334
famrel      0.7331474 0.5730317 0.9356179
sexM        2.5191408 1.6005358 4.0108387
[1] 0.2879581
                  OR     2.5 %    97.5 %
(Intercept) 0.159445 0.1012577 0.2427684
absences    1.101409 1.0549317 1.1548057
sexM        2.658116 1.6710354 4.2863129
[1] 0.2670157
                    OR       2.5 %     97.5 %
(Intercept) 0.02208168 0.008754964 0.05186666
sexM        2.39559680 1.484457552 3.91015147
goout       2.14025612 1.708110942 2.71701439
[1] 0.2277487
                   OR      2.5 %    97.5 %
(Intercept) 0.1314829 0.03816207 0.4283774
famrel      0.6930400 0.53161279 0.8999454
goout       2.2081701 1.76229526 2.8045361
[1] 0.2565445
                  OR      2.5 %     97.5 %
(Intercept) 0.026083 0.01076768 0.05922061
absences    1.080069 1.03581041 1.13071158
goout       2.087359 1.66817170 2.64318173
[1] 0.2460733
                    OR      2.5 %     97.5 %
(Intercept) 0.03504298 0.01505792 0.07687797
goout       2.13743726 1.71351154 2.69965289
[1] 0.2696335
                     OR        2.5 %    97.5 %
(Intercept) 0.006582546 8.420856e-05 0.4625185
famrel      0.642908900 4.804204e-01 0.8557823
absences    1.084677514 1.038897e+00 1.1349851
sexM        2.362741264 1.404117e+00 4.0253561
G3          0.978352444 8.988314e-01 1.0653837
freetime    1.143635232 8.645141e-01 1.5163385
age         1.129138861 8.986345e-01 1.4213128
failures    1.194170133 7.676563e-01 1.8609556
famsizeLE3  1.416227556 8.057840e-01 2.4756274
romanticyes 0.627696639 3.507903e-01 1.1001635
health      1.145482751 9.468044e-01 1.3925122
goout       2.042410603 1.593579e+00 2.6564950
[1] 0.2408377


Analysis of Boston data

Overview of data

Boston data is included in R-package as a demonstration or example.

Dataset contains social, environmental and economical information about great Boston area. It includes following variables:

  • crim = per capita crime rate by town
  • zn = proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus = proportion of non-retail business acres per town
  • chas = Charles River dummy variable
  • nox = nitrogen oxides concentration (parts per 10 million)
  • rm = average number of rooms per dwelling
  • age = proportion of owner-occupied units built prior to 1940
  • dis = weighted mean of distances to five Boston employment centres
  • rad = index of accessibility to radial highways
  • tax = full-value property-tax rate per $10000
  • ptratio = pupil-teacher ratio by town
  • black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • lstat = lower status of the population (percent)
  • medv = median value of owner-occupied homes in $1000s

Structure and the dimensions of the data

Dataset has 14 variables and 506 observations and all variables are numerical.

'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
[1] 506  14

Summary, graphical presentation of data and correlations

As seen in pairs plot, most of the variables are not normally distributed. Most of them are skewed and some of them are bimodal. Correlations between variables are better viewed in correlation plotting, where on the upper-right side the biggest circles indicate highest correlations (blue = positive or red = negative). Corresponding number values are mirrored on the lower-left side.

      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  


Standardization

In standardization means of all variables are in zero. That is, variables have distributed around zero. This can be seen in summary table (compare with original summary above).

Variable crime rate has been changed to categorical variable with 4 levels: low, med_low, med_high and high. Each class includes quantile of data (25%).

Train and test sets have been created by dividing original (standardized) data to two groups randomly. 80% belongs to train set and 20% to test set.

      crim                 zn               indus        
 Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
 1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
 Median :-0.390280   Median :-0.48724   Median :-0.2109  
 Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
 Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
      chas              nox                rm               age         
 Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
 1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
 Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
 Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
      dis               rad               tax             ptratio       
 Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
 1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
 Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
 Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
     black             lstat              medv        
 Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
 1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
 Median : 0.3808   Median :-0.1811   Median :-0.1449  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
 Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865  

Linear discriminant analysis (LDA)

In linear discriminant analysis (LDA) only the train set (80% of data) has been analysed. Target variable is the new categorical variable, crime rate (low, med_low, med_high, high). In LDA model all other variables of the data set are used as predictor variables (see Overview of data).

In biplot below can be seen that variable “rad” (index of accessibility to radial highways) has extremely high influence to LD1 and LD2 if compared to the other variables. In biplot all horizontal vectors describes contribution to LD1 dimension (x-axis) and vertical vectors LD2-dimension (y-axis). Sign of coefficient of linear discriminant determines the direction of vector. The longer the vector, the bigger is influence. Most of the vectors contribute both LD1 and LD2. Because in biplot two dimensions are illustrated, directions of most of variables are in different angles between LD1 and LD 2. For example, in the LDA table below the most significant variable of LD1 “rad” has coefficients LD1 = 3.27 and LD2 = 1.05. They are directly readable as coordinates of the arrow head. Similarly the second most significant variable of LD2, “nox” has its head ccordinates in (-0.69, 0.29). LDA1 alone explains 0.95% of model. LD2 explains 3% and LD3 only 1%.

Call:
lda(crime ~ ., data = train)

Prior probabilities of groups:
      low   med_low  med_high      high 
0.2698020 0.2450495 0.2326733 0.2524752 

Group means:
                  zn      indus         chas        nox         rm
low       0.89721928 -0.9307956 -0.163968511 -0.8651928  0.4406050
med_low  -0.08943564 -0.2602144  0.006051757 -0.5695496 -0.1890307
med_high -0.36681929  0.2120771  0.104627338  0.4163107  0.1934465
high     -0.48724019  1.0171096 -0.002135914  1.0701027 -0.4363337
                age        dis        rad        tax       ptratio
low      -0.8610397  0.8714602 -0.6900131 -0.7476016 -0.4273818243
med_low  -0.3419466  0.3885039 -0.5491659 -0.5038891 -0.0009229993
med_high  0.3991542 -0.3838368 -0.4076377 -0.2849484 -0.2172928643
high      0.8233050 -0.8503314  1.6382099  1.5141140  0.7808717661
               black       lstat        medv
low       0.37780299 -0.76751354  0.51594233
med_low   0.31727485 -0.12950181 -0.03322063
med_high  0.06250276 -0.02539694  0.22164262
high     -0.69660204  0.88258910 -0.67182938

Coefficients of linear discriminants:
                 LD1         LD2         LD3
zn       0.097333374  0.60521955 -0.89053411
indus    0.008781588 -0.25139543  0.53736469
chas    -0.063165346 -0.05511780  0.20222976
nox      0.404646983 -0.88580071 -1.29949668
rm      -0.098888298 -0.15588670 -0.25067228
age      0.217000006 -0.22608007 -0.07639453
dis     -0.073969766 -0.26304038  0.27320300
rad      3.085380950  1.14840469  0.12113481
tax      0.100655461 -0.15260023  0.15586438
ptratio  0.096272773 -0.05936972 -0.28929495
black   -0.109628593  0.06389992  0.15247893
lstat    0.227413156 -0.22413918  0.41890885
medv     0.163711164 -0.38879654 -0.14258188

Proportion of trace:
   LD1    LD2    LD3 
0.9500 0.0356 0.0144 

Predictive power of the model

In the test dataset catecorigal crime variable has been removed. In the table below true values of the original test data and predicted values of the test data (crime removed) are cross-tabulated. Total amount of observations is 102 (506/5 +1). In the table on diagonal axis (from top-left corner) are true values (sum = 76) and all others are predicted values (sum = 26). Prediction error is 26/102 ≈ 0.25

          predicted
correct    low med_low med_high high Sum
  low       13       5        0    0  18
  med_low    6      15        6    0  27
  med_high   0      13       18    1  32
  high       0       0        0   25  25
  Sum       19      33       24   26 102

Calculation of distances between the observations and optimal number of clusters

In this model euclidean distance matrix has been calculated. Results can be seen in table below. By using K-means algorithm, the optimal number of clusters can be investigated. When TWSS (total within sum of squares) drops significally, it indicates optimal number of clusters. In this case optimal number of clusters is 2 or 3. In the first plotting, data has classified into two and in the second plotting three clusters.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1343  3.4625  4.8241  4.9111  6.1863 14.3970 

Bonus

Here LDA is calculated with the clusters as target classes. All other variables in the Boston data are predictor variables. In LDA tables and biplots, differences between number of clusters can be seen. Depending on number of clusters, meaningful variables are different, as seen in plottings.

Call:
lda(clu3 ~ ., data = boston_scaled_new)

Prior probabilities of groups:
        1         2         3 
0.3083004 0.4130435 0.2786561 

Group means:
        crim         zn      indus        nox         rm         age
1  0.8647905 -0.4872402  1.0949412  1.1524404 -0.4717159  0.78718751
2 -0.3763256 -0.3947158 -0.1595373 -0.2992872 -0.2772450  0.02424663
3 -0.3989734  1.1241495 -0.9749470 -0.8314162  0.9328502 -0.90687091
         dis        rad        tax     ptratio      black       lstat
1 -0.8532750  1.3172714  1.3530841  0.57186480 -0.6936034  0.88830984
2  0.0304962 -0.5856776 -0.5482746  0.09789185  0.2769803 -0.03824517
3  0.8988453 -0.5892747 -0.6843385 -0.77780359  0.3568316 -0.92612124
        medv
1 -0.6893319
2 -0.1525634
3  0.9888052

Coefficients of linear discriminants:
                LD1        LD2
crim     0.03403154  0.1907082
zn       0.04097709  0.8724398
indus   -0.41998138 -0.1847419
nox     -1.06730082  0.6654667
rm       0.26776386  0.5104809
age      0.21123453 -0.4670361
dis     -0.07658281  0.3338914
rad     -1.17679746  0.3612912
tax     -0.96644205  0.5042929
ptratio -0.06864005 -0.1601563
black    0.07085932 -0.0359700
lstat   -0.26379244  0.3189355
medv     0.01076674  0.6395889

Proportion of trace:
   LD1    LD2 
0.8729 0.1271 
Call:
lda(clu4 ~ ., data = boston_scaled_new)

Prior probabilities of groups:
        1         2         3         4 
0.4011858 0.1660079 0.3221344 0.1106719 

Group means:
        crim         zn      indus        nox         rm         age
1 -0.3793592 -0.3541732 -0.2369409 -0.3788647 -0.3076659 -0.09191194
2 -0.4124621  1.9031602 -1.0764395 -1.1428600  0.6095020 -1.39546865
3  0.8082769 -0.4872402  1.1165562  1.1413403 -0.4676591  0.79696608
4 -0.3587926 -0.1526454 -0.7764063 -0.2344411  1.5622579  0.10664320
         dis        rad        tax     ptratio      black       lstat
1  0.1379588 -0.5920713 -0.5838167  0.08129234  0.2784401 -0.06159606
2  1.5205894 -0.6250261 -0.5943244 -0.67561813  0.3537454 -0.90687350
3 -0.8539968  1.2199444  1.2927317  0.58616084 -0.6486732  0.87910380
4 -0.2952439 -0.4671119 -0.7549503 -0.98740427  0.3481389 -0.97522401
        medv
1 -0.1690186
2  0.6658984
3 -0.7034406
4  1.6613592

Coefficients of linear discriminants:
                LD1          LD2         LD3
crim    -0.03692074 -0.063843487  0.17437690
zn      -0.10469860 -1.680125461 -0.02034829
indus    0.62305615 -0.375784264 -0.51749840
nox      1.07818726 -0.501408664  0.49444666
rm      -0.13015732 -0.062534323  0.64219861
age     -0.18778510  0.593603356  0.12209172
dis      0.01495296 -0.528845086 -0.16022047
rad      0.71521529  0.091708443  0.26711660
tax      0.86706461 -0.832332218  0.24904935
ptratio  0.21355380 -0.110643988 -0.18716017
black   -0.01780187  0.008088268 -0.03064577
lstat    0.23409949 -0.104352523  0.29032605
medv    -0.14661672  0.063424182  0.97435297

Proportion of trace:
   LD1    LD2    LD3 
0.7143 0.2084 0.0773 
Call:
lda(clu5 ~ ., data = boston_scaled_new)

Prior probabilities of groups:
         1          2          3          4          5 
0.10671937 0.06916996 0.39328063 0.23913043 0.19169960 

Group means:
        crim         zn      indus        nox         rm        age
1 -0.2753323 -0.4872402  1.5337294  1.1273809 -0.6003284  0.9334996
2  1.4802645 -0.4872402  1.0149946  0.9676887 -0.2969389  0.7656016
3 -0.3884901 -0.3308141 -0.4873088 -0.4761310 -0.2318056 -0.1989968
4 -0.3981339  1.2930469 -0.9902994 -0.8283387  1.0566896 -0.9121115
5  0.9128084 -0.4872402  1.0149946  1.0333132 -0.4012324  0.7501117
         dis        rad         tax    ptratio      black      lstat
1 -0.8995039 -0.6096828  0.01485481 -0.3917541 -0.1262348  0.6474932
2 -0.8580043  1.6596029  1.52941294  0.8057784 -3.2970564  1.1699052
3  0.2356027 -0.5732709 -0.60914944  0.1061891  0.3164212 -0.1478389
4  0.9121798 -0.5955687 -0.67325561 -0.8788401  0.3554817 -0.9544043
5 -0.8108798  1.6596029  1.52941294  0.8057784  0.1673459  0.7112530
        medv
1 -0.4618433
2 -1.0473739
3 -0.1012054
4  1.0995470
5 -0.5289453

Coefficients of linear discriminants:
                LD1         LD2         LD3         LD4
crim     0.11743258  0.04770450 -0.14300153 -0.16024852
zn       0.34092540  0.12344732 -0.08107374 -1.23725268
indus    0.53753143 -1.85979858  0.92454526 -1.10944073
nox     -0.00906438 -0.29233935  0.67637180 -0.69383481
rm      -0.05331958  0.27825890 -0.27651832 -0.52558718
age     -0.02982587 -0.24810645  0.11616057  0.17202368
dis     -0.18913313 -0.03732336  0.16315838 -0.26855975
rad      5.88697673  2.04722673  0.02514581  0.50932714
tax      0.13814687  0.14410388  0.09950265 -0.25979580
ptratio  0.20734779  0.06893525 -0.07716785  0.22476182
black   -0.33376654  1.18823488  2.14340012 -0.03614042
lstat   -0.01997449  0.13151481 -0.19758426 -0.25835224
medv    -0.11300541  0.24641561 -0.16457165 -0.62281824

Proportion of trace:
   LD1    LD2    LD3    LD4 
0.8192 0.0840 0.0681 0.0287 
Call:
lda(clu6 ~ ., data = boston_scaled_new)

Prior probabilities of groups:
         1          2          3          4          5          6 
0.30237154 0.12845850 0.21146245 0.09683794 0.05731225 0.20355731 

Group means:
        crim         zn      indus        nox         rm        age
1 -0.3975627 -0.1637000 -0.5873353 -0.6615915 -0.1682976 -0.6574663
2 -0.4140702  2.2813035 -1.1562550 -1.1768217  0.7293996 -1.4086475
3 -0.3231486 -0.4822312  0.6308150  0.5041272 -0.5223445  0.7753457
4 -0.3680230 -0.1494734 -0.7440331 -0.2107180  1.7049351  0.1966439
5  3.0022987 -0.4872402  1.0149946  1.0593345 -1.3064650  0.9805356
6  0.5173303 -0.4872402  1.0149946  1.0036872 -0.1109215  0.6904986
         dis        rad        tax     ptratio       black      lstat
1  0.5448232 -0.5720261 -0.6911565 -0.06248294  0.36026575 -0.3806108
2  1.5645584 -0.6656010 -0.5702843 -0.80946918  0.35416061 -0.9741045
3 -0.5711540 -0.5943977 -0.1915536  0.08745080  0.03673407  0.5492158
4 -0.3113322 -0.5037339 -0.7871603 -1.09274698  0.34883211 -0.9623279
5 -1.0484716  1.6596029  1.5294129  0.80577843 -1.19066142  1.8708759
6 -0.7599982  1.6596029  1.5294129  0.80577843 -0.62752658  0.5406100
         medv
1  0.01419929
2  0.81491803
3 -0.48624639
4  1.73167306
5 -1.31020021
6 -0.48514538

Coefficients of linear discriminants:
                LD1         LD2          LD3         LD4         LD5
crim     0.26905214 -0.04038901  0.952783084  1.05716987  0.83157632
zn       0.02554270 -1.56332967  0.861975978 -1.07176445  0.45885648
indus    0.26308114  0.34083276  0.299613031 -0.64713459 -0.27149224
nox     -0.03462213  0.07174078 -0.011080924 -0.44577196  0.17727824
rm      -0.09400180 -0.18081151 -0.748312466 -0.28307867  0.35366014
age      0.09743790  0.59122392 -0.124953000 -0.52102866  0.63971252
dis     -0.28508543 -0.26165892 -0.004961456  0.09291479 -0.31814880
rad      5.99463004 -1.10320047 -1.528777174  0.82447556 -1.05043782
tax      0.09961052 -0.43262170  0.468017911 -0.81595269  0.38580292
ptratio  0.25749486  0.08897531  0.282642121 -0.34518855 -0.02514549
black   -0.04609198 -0.01324538 -0.083403025  0.01031609 -0.10644781
lstat    0.09758566  0.09732902  0.327335484  0.20747689  0.64677037
medv    -0.06198719 -0.18784402 -0.206662871  0.28964586  0.98112363

Proportion of trace:
   LD1    LD2    LD3    LD4    LD5 
0.8529 0.0830 0.0298 0.0185 0.0159 

Super-Bonus

Data points are of course in same positions. Grouping differs slightly in main group if colours are coded either by crime or by cluster. In the separate group high-crime is well isolated whereas in clusters, there are two of them. If colours are coded by crime, particularly the high-crime is better gathered to one group.

Train data classified by Crime (1 = low, 4 = high)
Train data classified by Clusters

PCA analysis of human data

Overview of data

Data for this analysis is from the United Nations Development Programme. Two datasets have been combined, Human Development Index (HDI) and Gender Inequality Index (GII). As a result, the new dataset includes following variables:

  • Life_Exp = Expected lifetime in years
  • Edu_Exp = Expected education time in years
  • GNI = Gross National Income
  • Mat_Mortal = Maternal mortality per 1000
  • Teen_Preg = Teen pregnancy per 1000
  • F_Rep_Parl = Female members in parliament (%)
  • Edu_Rto_FM = Ratio between females and males in education
  • Work_Rto_FM = Ratio between females and males in labour

Most of variables are to some extend skewed. In correlation matrix high positive and negative correlations can be throughout seen. Actually, most of the variables of this dataset correlated strongly. Only F_Rep_Parl and Work_Rto_FM had weak correlations to all other variables.

    Life_Exp        Edu_Exp           GNI           Mat_Mortal    
 Min.   :49.00   Min.   : 5.40   Min.   :   581   Min.   :   1.0  
 1st Qu.:66.30   1st Qu.:11.25   1st Qu.:  4198   1st Qu.:  11.5  
 Median :74.20   Median :13.50   Median : 12040   Median :  49.0  
 Mean   :71.65   Mean   :13.18   Mean   : 17628   Mean   : 149.1  
 3rd Qu.:77.25   3rd Qu.:15.20   3rd Qu.: 24512   3rd Qu.: 190.0  
 Max.   :83.50   Max.   :20.20   Max.   :123124   Max.   :1100.0  
   Teen_Preg        F_Rep_Parl      Edu_Rto_FM      Work_Rto_FM    
 Min.   :  0.60   Min.   : 0.00   Min.   :0.1717   Min.   :0.1857  
 1st Qu.: 12.65   1st Qu.:12.40   1st Qu.:0.7264   1st Qu.:0.5984  
 Median : 33.60   Median :19.30   Median :0.9375   Median :0.7535  
 Mean   : 47.16   Mean   :20.91   Mean   :0.8529   Mean   :0.7074  
 3rd Qu.: 71.95   3rd Qu.:27.95   3rd Qu.:0.9968   3rd Qu.:0.8535  
 Max.   :204.80   Max.   :57.50   Max.   :1.4967   Max.   :1.0380  

Principal component analysis (PCA)

In principal component analysis the original multidimensional data is transformed to new space. Most of the cases the number of dimensions has been diminished. The new dimensions are called principal components. Principal components are uncorrelated to each other. The first principal component explains maximum amount of variance in original data. The second component is orthogonal to the first and explains maximum variance left. This principle continues similarly with next components and each of them are less important compared to the previous one. If eigenvalue of the component is smaller than one, it usually can be left out (see PCA tables below). In biplots, PC1 is usually visualized on x-axis and PC2 in y-axis. Arrows and labels show connections between the original features and the principal components (usually PC1 and PC2).

In biplots angles between arrows correspond correlations of the original features. The smaller the angle, the higher is positive correlation. If two arrows are parallel but point to opposite directions, it indicates high negative correlation. 90 degree’s angle means zero correlation (orthogonal = compare to x-axis (PC1) and y-axis (PC2)). The length of the arrows shows the standard deviations of the original features.

    Life_Exp        Edu_Exp           GNI           Mat_Mortal    
 Min.   :49.00   Min.   : 5.40   Min.   :   581   Min.   :   1.0  
 1st Qu.:66.30   1st Qu.:11.25   1st Qu.:  4198   1st Qu.:  11.5  
 Median :74.20   Median :13.50   Median : 12040   Median :  49.0  
 Mean   :71.65   Mean   :13.18   Mean   : 17628   Mean   : 149.1  
 3rd Qu.:77.25   3rd Qu.:15.20   3rd Qu.: 24512   3rd Qu.: 190.0  
 Max.   :83.50   Max.   :20.20   Max.   :123124   Max.   :1100.0  
   Teen_Preg        F_Rep_Parl      Edu_Rto_FM      Work_Rto_FM    
 Min.   :  0.60   Min.   : 0.00   Min.   :0.1717   Min.   :0.1857  
 1st Qu.: 12.65   1st Qu.:12.40   1st Qu.:0.7264   1st Qu.:0.5984  
 Median : 33.60   Median :19.30   Median :0.9375   Median :0.7535  
 Mean   : 47.16   Mean   :20.91   Mean   :0.8529   Mean   :0.7074  
 3rd Qu.: 71.95   3rd Qu.:27.95   3rd Qu.:0.9968   3rd Qu.:0.8535  
 Max.   :204.80   Max.   :57.50   Max.   :1.4967   Max.   :1.0380  


Call:
PCA(X = human) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
Variance               4.288   1.299   0.766   0.607   0.438   0.288
% of var.             53.605  16.237   9.571   7.583   5.477   3.595
Cumulative % of var.  53.605  69.842  79.413  86.996  92.473  96.069
                       Dim.7   Dim.8
Variance               0.211   0.104
% of var.              2.634   1.298
Cumulative % of var.  98.702 100.000

Individuals (the 10 first)
                  Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
Norway        |  3.957 |  3.094  1.440  0.611 |  1.896  1.786  0.230 |
Australia     |  3.489 |  3.003  1.357  0.741 |  1.192  0.706  0.117 |
Switzerland   |  3.146 |  2.706  1.101  0.740 |  0.900  0.402  0.082 |
Denmark       |  3.430 |  2.790  1.171  0.662 |  1.779  1.572  0.269 |
Netherlands   |  3.240 |  2.736  1.126  0.713 |  1.474  1.079  0.207 |
Germany       |  2.959 |  2.521  0.956  0.726 |  1.325  0.872  0.200 |
Ireland       |  2.883 |  2.637  1.046  0.837 |  0.363  0.066  0.016 |
United States |  2.652 |  2.164  0.704  0.666 |  0.463  0.106  0.030 |
Canada        |  2.595 |  2.270  0.775  0.766 |  1.037  0.534  0.160 |
New Zealand   |  3.010 |  2.503  0.943  0.692 |  1.273  0.805  0.179 |
               Dim.3    ctr   cos2  
Norway        -0.057  0.003  0.000 |
Australia      0.107  0.010  0.001 |
Switzerland    0.276  0.064  0.008 |
Denmark       -0.172  0.025  0.003 |
Netherlands   -0.309  0.081  0.009 |
Germany       -0.390  0.128  0.017 |
Ireland        0.643  0.348  0.050 |
United States  0.879  0.651  0.110 |
Canada         0.294  0.073  0.013 |
New Zealand   -0.015  0.000  0.000 |

Variables
                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
Life_Exp      |  0.919 19.689  0.844 | -0.029  0.064  0.001 | -0.096
Edu_Exp       |  0.886 18.290  0.784 |  0.159  1.943  0.025 |  0.064
GNI           |  0.726 12.284  0.527 |  0.058  0.256  0.003 |  0.176
Mat_Mortal    | -0.905 19.094  0.819 |  0.165  2.105  0.027 |  0.110
Teen_Preg     | -0.852 16.913  0.725 |  0.088  0.594  0.008 | -0.017
F_Rep_Parl    |  0.175  0.712  0.031 |  0.742 42.428  0.551 | -0.634
Edu_Rto_FM    |  0.739 12.719  0.545 |  0.043  0.144  0.002 |  0.212
Work_Rto_FM   | -0.113  0.298  0.013 |  0.826 52.465  0.682 |  0.511
                 ctr   cos2  
Life_Exp       1.208  0.009 |
Edu_Exp        0.539  0.004 |
GNI            4.068  0.031 |
Mat_Mortal     1.568  0.012 |
Teen_Preg      0.039  0.000 |
F_Rep_Parl    52.572  0.403 |
Edu_Rto_FM     5.868  0.045 |
Work_Rto_FM   34.139  0.261 |

Importance of components:
                             PC1      PC2   PC3   PC4   PC5   PC6    PC7
Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
                          PC8
Standard deviation     0.1591
Proportion of Variance 0.0000
Cumulative Proportion  1.0000
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
100   0   0   0   0   0   0   0 

Standardization of data and PCA

Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomenons they relate to. (0-4 points)

Due to PCA is sensitive to the scaling, large variance of the original feature get greater importance compared to a feature with smaller variance. Standardization of the original data equalizes importance between original fatures.

In following table and biplot below, PCA analysis itself is the same as above, but original data has been standardized before calculating PCA.

    Life_Exp          Edu_Exp             GNI            Mat_Mortal     
 Min.   :-2.7188   Min.   :-2.7378   Min.   :-0.9193   Min.   :-0.6992  
 1st Qu.:-0.6425   1st Qu.:-0.6782   1st Qu.:-0.7243   1st Qu.:-0.6496  
 Median : 0.3056   Median : 0.1140   Median :-0.3013   Median :-0.4726  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.6717   3rd Qu.: 0.7126   3rd Qu.: 0.3712   3rd Qu.: 0.1932  
 Max.   : 1.4218   Max.   : 2.4730   Max.   : 5.6890   Max.   : 4.4899  
   Teen_Preg         F_Rep_Parl        Edu_Rto_FM       Work_Rto_FM     
 Min.   :-1.1325   Min.   :-1.8203   Min.   :-2.8189   Min.   :-2.6247  
 1st Qu.:-0.8394   1st Qu.:-0.7409   1st Qu.:-0.5233   1st Qu.:-0.5484  
 Median :-0.3298   Median :-0.1403   Median : 0.3503   Median : 0.2316  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.6030   3rd Qu.: 0.6127   3rd Qu.: 0.5958   3rd Qu.: 0.7350  
 Max.   : 3.8344   Max.   : 3.1850   Max.   : 2.6646   Max.   : 1.6632  


Call:
PCA(X = human_std) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
Variance               4.288   1.299   0.766   0.607   0.438   0.288
% of var.             53.605  16.237   9.571   7.583   5.477   3.595
Cumulative % of var.  53.605  69.842  79.413  86.996  92.473  96.069
                       Dim.7   Dim.8
Variance               0.211   0.104
% of var.              2.634   1.298
Cumulative % of var.  98.702 100.000

Individuals (the 10 first)
                  Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
Norway        |  3.957 |  3.094  1.440  0.611 |  1.896  1.786  0.230 |
Australia     |  3.489 |  3.003  1.357  0.741 |  1.192  0.706  0.117 |
Switzerland   |  3.146 |  2.706  1.101  0.740 |  0.900  0.402  0.082 |
Denmark       |  3.430 |  2.790  1.171  0.662 |  1.779  1.572  0.269 |
Netherlands   |  3.240 |  2.736  1.126  0.713 |  1.474  1.079  0.207 |
Germany       |  2.959 |  2.521  0.956  0.726 |  1.325  0.872  0.200 |
Ireland       |  2.883 |  2.637  1.046  0.837 |  0.363  0.066  0.016 |
United States |  2.652 |  2.164  0.704  0.666 |  0.463  0.106  0.030 |
Canada        |  2.595 |  2.270  0.775  0.766 |  1.037  0.534  0.160 |
New Zealand   |  3.010 |  2.503  0.943  0.692 |  1.273  0.805  0.179 |
               Dim.3    ctr   cos2  
Norway        -0.057  0.003  0.000 |
Australia      0.107  0.010  0.001 |
Switzerland    0.276  0.064  0.008 |
Denmark       -0.172  0.025  0.003 |
Netherlands   -0.309  0.081  0.009 |
Germany       -0.390  0.128  0.017 |
Ireland        0.643  0.348  0.050 |
United States  0.879  0.651  0.110 |
Canada         0.294  0.073  0.013 |
New Zealand   -0.015  0.000  0.000 |

Variables
                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
Life_Exp      |  0.919 19.689  0.844 | -0.029  0.064  0.001 | -0.096
Edu_Exp       |  0.886 18.290  0.784 |  0.159  1.943  0.025 |  0.064
GNI           |  0.726 12.284  0.527 |  0.058  0.256  0.003 |  0.176
Mat_Mortal    | -0.905 19.094  0.819 |  0.165  2.105  0.027 |  0.110
Teen_Preg     | -0.852 16.913  0.725 |  0.088  0.594  0.008 | -0.017
F_Rep_Parl    |  0.175  0.712  0.031 |  0.742 42.428  0.551 | -0.634
Edu_Rto_FM    |  0.739 12.719  0.545 |  0.043  0.144  0.002 |  0.212
Work_Rto_FM   | -0.113  0.298  0.013 |  0.826 52.465  0.682 |  0.511
                 ctr   cos2  
Life_Exp       1.208  0.009 |
Edu_Exp        0.539  0.004 |
GNI            4.068  0.031 |
Mat_Mortal     1.568  0.012 |
Teen_Preg      0.039  0.000 |
F_Rep_Parl    52.572  0.403 |
Edu_Rto_FM     5.868  0.045 |
Work_Rto_FM   34.139  0.261 |

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6
Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
                           PC7     PC8
Standard deviation     0.45900 0.32224
Proportion of Variance 0.02634 0.01298
Cumulative Proportion  0.98702 1.00000
 PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3 

Interpretations of PC1 and PC2

Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.

Tea

Explore the data briefly: look at the structure and the dimensions of the data and visualize it.

        Tea         How                      how           sugar    
 black    : 74   alone:195   tea bag           :170   No.sugar:155  
 Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
 green    : 33   milk : 63   unpackaged        : 36                 
                 other:  9                                          
                  where           lunch    
 chain store         :192   lunch    : 44  
 chain store+tea shop: 78   Not.lunch:256  
 tea shop            : 30                  
                                           
'data.frame':   300 obs. of  6 variables:
 $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
 $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
 $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
 $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...

Multiple Correspondence Analysis on the tea data

Then do Multiple Correspondence Analysis on the tea data (or to a certain columns of the data, it’s up to you). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)


Call:
MCA(X = tea_time, graph = FALSE) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
Variance               0.279   0.261   0.219   0.189   0.177   0.156
% of var.             15.238  14.232  11.964  10.333   9.667   8.519
Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
                       Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
Variance               0.144   0.141   0.117   0.087   0.062
% of var.              7.841   7.705   6.392   4.724   3.385
Cumulative % of var.  77.794  85.500  91.891  96.615 100.000

Individuals (the 10 first)
                      Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
                      ctr   cos2  
1                   0.163  0.104 |
2                   0.735  0.314 |
3                   0.062  0.069 |
4                   0.068  0.073 |
5                   0.062  0.069 |
6                   0.062  0.069 |
7                   0.062  0.069 |
8                   0.735  0.314 |
9                   0.007  0.003 |
10                  0.643  0.261 |

Categories (the 10 first)
                       Dim.1     ctr    cos2  v.test     Dim.2     ctr
black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
                      cos2  v.test     Dim.3     ctr    cos2  v.test  
black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |

Categorical variables (eta2)
                     Dim.1 Dim.2 Dim.3  
Tea                | 0.126 0.108 0.410 |
How                | 0.076 0.190 0.394 |
how                | 0.708 0.522 0.010 |
sugar              | 0.065 0.001 0.336 |
where              | 0.702 0.681 0.055 |
lunch              | 0.000 0.064 0.111 |